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In this magnificent paper, Professor Koltchinskii offers general and pow- 
erful performance bounds for empirical risk minimization, a fundamental 
principle of statistical learning theory. Since the elegant pioneering work of 
Vapnik and Chervonenkis in the early 1970s, various such bounds have been 
known that relate the performance of empirical risk minimizers to combina- 
torial and geometrical features of the class over which the minimization is 
performed. This area of research has been a rich source of motivation and a 
major field of applications of empirical process theory. 

The appearance of advanced concentration inequalities in the 1990s, pri- 
marily thanks to Talagrand's influential work, provoked major advances in 
both empirical process theory and statistical learning theory and led to a 
much deeper understanding of some of the basic phenomena. In the dis- 
cussed paper Professor Koltchinskii develops a powerful new methodology, 
iterative localization, which, with the help of concentration inequalities, is 
able to explain most of the recent results and go significantly beyond them 
in many cases. 

The main motivation behind Professor Koltchinskii's paper is based on 
classical problems of statistical learning theory such as binary classification 
and regression in which, given a sample (Xj, Y^), i = 1, . . . , n, of independent 
and identically distributed pairs of random variables (where the Xi take their 
values in some feature space X and the Yi are, say, real- valued) , the goal 
is to find a function / : — > R whose risk, defined in terms of the expected 
value of an appropriately chosen loss function, is as small as possible. 

In the remaining part of this discussion we point out how the performance 
bounds of Professor Koltchinskii's paper can be used to study a seemingly 
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different model, motivated by nonparametric ranking problems, which has 
received increasing attention both in the statistical and machine learning 
literature. Indeed, in several applications, such as the search engine problem 
or credit risk screening, the goal is to learn how to rank — or to score — 
observations rather than just classifying them. In this case, performance 
measures involve pairs of observations, as it can be seen, for instance, with 
the AUG (Area Under an ROC Curve) criterion. In this setting, empirical 
risk functionals are no longer simple averages of i.i.d. random variables. De- 
spite this fact, we will see in the sequel that it is possible to apply Professor 
Koltchinskii's approach. Once again, concentration inequalities will play a 
major role. 

In a ranking problem one has to compare two different observations and 
decide which one is "better." To formalize the problem in a simple model, 
let {X, Y) be a pair of random variables taking values in ^ x M where Af is a 
measurable space. The random object X models some observation and Y its 
real- valued label. Let {X' ,Y') denote a pair of random variables identically 
distributed with {X,Y), and independent of it. In the ranking problem one 
observes X and X' but not necessarily their labels Y and Y' . We think about 
X being "better" than X' if y > y'. The goal is to rank X and X' such 
that the probability that the better ranked of them has a smaller label is as 
small as possible. Formally, a ranking rule is a function r: X x X ^ {"li !}• 
If r{x,x') = 1, then the rule ranks x higher than x' . The performance of a 
ranking rule is measured by the ranking risk 

L{r) = F{{Y - Y') ■ r{X, X') < 0}, 

that is, the probability that r ranks two randomly drawn instances incor- 
rectly. Clearly, the problem is equivalent to a binary classification problem 
in which the sign of the random variable Y — Y' is to be guessed based upon 
the pair of observations {X,X'). 

Now assume that n independent, identically distributed copies of (X,Y) 
are available: = {Xi,Yi), . . . , Given a ranking rule r, one may 

use the training data to estimate its risk L{r) = ¥{{Y — Y') ■ r{X,X') < 0}. 
The perhaps most natural estimate is the U -statistic 

By following the empirical risk minimization strategy studied in the dis- 
cussed paper, one may consider minimizers of the empirical estimate Ln (r) 
over a class TZ of ranking rules r:^ x ^ — > {—1,1} and study the perfor- 
mance of such empirically selected ranking rules. Define the empirical risk 
minimizer, over TZ, by 

rn = argminL.„(r). 
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In a way analogous to how ordinary empirical risk minimization leads to 
questions best attacked by using tools of empirical process theory, the key 
of bounding the performance of an empirical minimizer of the ranking risk 
is in investigating the properties of U -processes. For a detailed and modern 
account of [/-process theory we refer to the excellent book of de la Peha 
and Gine [4]. Interestingly, however, bounding the performance of empirical 
ranking risk minimization boils down to ordinary empirical risk minimization 
as it is pointed out in the sequel. 

We are interested in the risk -L(r„) of the empirical minimizer r^, when 
compared to L* , the risk of the best possible ranking function r* in the sense 
that L* = L{r*) < L{r) for all measurable ranking functions r. 

Set first 

Qriix,y),{x' ,y')) =l[{y-y').r{x,x')<0] -\y-y')-r''{x,x')<0] 

and consider the following estimate of the excess ranking risk A(r) = L{r) — 
L*=Eqr{{X,Y),{X', ¥'))■■ 

An(r)= , ^_ ^ g.((x„y,),(x„y,-)), 

which is a [/-statistic of degree 2 with symmetric kernel qr- Clearly, the 
minimizer r„ of the empirical ranking risk Ln{r) over TZ also minimizes the 
empirical excess risk A„(r). To study this minimizer, consider the Hoeffding 
decomposition [5] of A„(r): 

A„(r)-A(r) = 2T„(r) + T^„(r), 

where 

1 " 

Tn{r) = -Y.hr{Xi,Yi) 

1=1 

is a sum of i.i.d. random variables with 

hr{x,y)=Eqr{{x,y),iX',Y'))-A{r) 

and 

Wnir) = \ J2'f'r{iX^,Yi),{X„Y,)) 

is a degenerate [/-statistic with symmetric kernel 

%i{x,y),{x' ,y')) = qr{{x,y),{x' ,y')) - A(r) -hr{x,y) -hr{x',y'). 

The main message of this note is that one can show that, under very gen- 
eral circumstances, the contribution of the degenerate part Wn{r) of the 
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J7-statistic is negligible compared to that of r„(r). This means that mini- 
mization of A„ is approximately equivalent to minimizing T„(r). But since 
T„(r) is an average of i.i.d. random variables, this can be studied by tech- 
niques worked out for empirical risk minimization, such as the ones in the 
discussed paper. 

The main tool for handling the degenerate part is a new general moment 
inequality for [/-processes established in [3], based mostly on concentration 
and moment inequalities for empirical processes, Rademacher averages, and 
Rademacher chaos, developed in [1], as well as decoupling and randomization 
techniques; see [4]. In order to recall this result, we need to introduce some 
quantities related to the class TZ. Let £i,. . . ,£n be i.i.d. Rademacher random 
variables independent of the {Xi,Yi). Let 



= sup 

ren 



Ue = SUp sup ^£iajhri{Xi,Yi),{Xj,Yj)), 
^ i,j 
n 



rGTZa: ||a||2<l 



M = sup 

rGTi,k=l,...,n 



i=l 



It is shown in [3] that there exists a universal constant C such that, with 
probability at least I — S, 



This inequality bounds the degenerate part of the [/-process in terms of 
expected values of certain Rademacher averages and chaoses indexed by IZ. 
These quantities have been thoroughly studied and well understood, and 
may be easily bounded in many interesting cases. For example, it is not dif- 
ficult to see that if 7^ is a class of ranking functions of finite VC-dimension V , 
then the right-hand side above is of the order of {V + log(l/(5))/n. 

Whenever sup^g-;^ l^nl'')! is small, by Hoeffding's decomposition, the min- 
imization of the empirical ranking risk is approximately equivalent to the 
minimization of the ordinary empirical process T„(r). This can be studied 
by the rich theory developed in Professor Koltchinskii's paper. For example, 
the variance-control techniques mentioned in Section 7 can be applied for 
this case to derive fast rates of convergence. In particular, it is pointed out 
in [3] that if there exist constants c> and a E [0, 1] such that for all r G 7^, 

Var(/i^,(X,y)) <cA(r)", 
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then fast rates of convergence may be achieved, depending on the value of 
a and the modulus of continuity of the empirical process 

yn{r) = -Y^l{r, {Xi,Yi)) - Lir), 

Tl . 

where 

£{r, {x,y)) = 2E\y_Y).r{x,x)<o] - L{r). 

For some explicit performance bounds involving these conditions, we refer 
to [3]. Here we just recall two simple corollaries, derived in [2] in the case 
when the class TZ has a finite VC-dimension V: 
Assume that 

• either: Y E { — 1, 1} is binary-valued and r]{x) = P{y = 1\X = x} is such 
that the random variable r]{X) has an absolutely continuous distribution 
on [0, 1] with a density bounded by B; 

• or: y = m{X) + a{X)N for some (unknown) functions m : <Y ^ M and 
CT : Af — > M, and iV is a standard normal random variable, independent of 
X such that m{X) has a bounded density and the conditional variance 
cr{x) is bounded over X. 

Then there is a constant C such that for every 6,e £ (0,1), the excess 
ranking risk of the empirical ranking minimizer satisfies, with probability 
at least 1 — 5, 

/ \ 1 /l/log(n/(5)\^/(^+'') 

L{rn) - L* <2(mf L{r)-L*) +Ce'M 

Professor Koltchinskii's paper raises many interesting questions about how 
his new sharp results can be used to prove improved performance bounds 
for ranking problems. For example, obtaining sharp data-dependent upper 
confidence bounds so crucial for penalized model selection remains a chal- 
lenge. We expect that once again, concentration inequalities will be the key 
for obtaining powerful oracle inequalities for ranking problems. 
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